Welcome to my spot on the web for drafts, supplemental material, and general thoughts about doing a thesis project
for the Master of Science in Predictive Analytics degree (now the Master's in Data Science (MSDS) program)
from Northwestern University. Below the interactive plots, I'm developing a sort of "epilogue" containing
thoughts about doing a data science Master's, choosing the thesis option, and some of the things I've learned
along the way.
Thesis Paper
I'll update this section with drafts as they get finished.
2018-11-04: I have a (mostly) completed draft you can check out here on Google Drive.
I'm currently awaiting comments from readers so no doubt it will change substantially. I haven't put in
a Table of Contents and I'm still figuring out how to list the summlemental materials you'll find on this page
in but everything else is there (hooray!).
Below are four interactive multidimensional scaling plots of genetic profiles developed from open-source RNA-seq
data available from the Aging, Dementia, and TBI Study
from the Allen Brain Science Institute.
Use your mouse to grab them, rotate them, and zoom in and out. Hovering over a data point gives the point's coordinates in the first three MDS dimensions. Each point
represents a genetic profile (based on expression levels for 50,000+ genes and gene isoforms) for an individual patient/donor.
These were made using Plotly and htmlwidgets
for R. Check out this blog post
for more on multidimensional scaling of gene expression level data.
Shaded by Brain Region
HIP = hippocampus
FWM = forebrain white matter
PCx = parietal cortex
TCx = temporal cortex
plotly
Shaded by Donor Sex
plotly
Shaded by Lifetime Number of Traumatic Brain Injuries (TBIs)
A comparison of the numbers of "significant" genes obtained with different filtering parameters and
p-value cutoffs for determining differential expression in donors with dementia.
As a part of the exploratory analysis of the RNA-seq transcriptome data, I investigated the 29 genes that had
altered expression patterns in all four brain regions sampled from donors with dementia
(hippocampus, forebrain white matter, parietal cortex, or temporal cortex).
Things I've Learned by Doing a Data Science Master's Thesis
As things start to wrap up for me, I'm finding myself reflecting on the entire experience of doing the MSPA program.
Maybe you stumbled onto this page beacuse you're thinking of pursuing a data science Master's
degree. Or maybe you're already in the MSDS program at Northwestern or somewhere else and are trying to
make the "thesis or capstone" decision. In this section, I'll be keeping a list of some of the things
I've learned from doing this degree with a focus on doing a thesis project. Just my $0.02. FWIW, etc. I'm
putting it down here as a sort of epilogue to the thesis once she's all done.
“Life can only be understood backwards; but it must be lived forwards.” - Kierkegaard
Doing this program was a great decision for me. As someone moving from academia to industry
AND changing careers, the ability to talk to and learn from people already doing data science in
a variety of industries was exactly what I needed. Classes were challenging and I appreciated the flexability
of an entirely online program. My classmates are incredible people. I learned so much from interacting with
them and our instructors as well. The structure of an actual academic program was good for me because it
kept me on track and provided me with a level of accountability, ensuring that I was learning what I needed to
learn. Your mileage may vary but, for me, it was well worth the investment I made (more specifics on that soon 😉).
You get out of it what you put into it. Most educational experiences are like this, I bet. I'm
not saying anything here you probably don't already know. I figure if you (or your company) is going to
drop a lot of coin on a program like this, why not go the extra mile, if you can? Show up. Be creative. And don't be
afraid to come in last in your class in a Kaggle InClass competition 😉 It just might be the best
thing that ever happens to you.
Got time and an idea? Not doing data science for a living yet? Do the thesis. I have learned
more in the past year of self-study for the thesis project than I ever thought I would. I grok
more about statistics, clustering, penalized linear models, binary classifiers, and so much more now for
having done this thing. I feel like I can talk about those things and be confident in what I'm saying.
Personally, I learn best by doing, screwing it up, doing it over, screwing it up some more, etc. If you're
not doing data science for a living yet, and don't have the opportunity to work with real data on the reg, the
thesis project can be a terrific way to get an understanding beyond the Titanic and MNIST.
BUT, doing the thesis will take a long time. Maybe not for some people, but on average it does
take longer than a quarter. Maybe two. I gave myself a year to do it with everything else going on in my life
and it could take longer. But for me it is worth it. You'll have to weigh the options for yourself.
The Northwestern MSDS Canvas site has resources to help you decide if doing a thesis is for you. Also,
Dr. Alianna Maren has a very honest flowchart
for making the decision that you can check out. Be prepared to work independently but don't pass up
opportunities to use University resources like The Writing Place.
You/your company are, after all, paying for them 😊
If you have the time, blog about your journey. I'm writing this in HTML right now, something I
never thought I'd learn over the course of doing the program/pivoting to a career in data science.
I'm so glad I discovered GitHub Pages and set up this website because I've learned so much bonus stuff
in the process. A little web design. A little CSS and HTML. Even a little Ruby. It's a place to showcase
your work and maybe (hopefully!) interact with others. And speaking of GitHub...
GitHub is amazing. Get an account if only to share code with your classmates but, it is so much
more than that. Software developers moving to data science already know about this thing but I had no idea.
I had an epiphany back in about April/May 2018 when I learned a little about how GitHub is actually
used to organize projects and stuff. I mean...
I changed all the scripts I had written for the thesis project up to that point so that, when I finished
the project, anybody could clone the project repository, run the scripts in order, and reproduce any of the
results I put in my thesis or on this blog. I mean, wow. It blew my mind when I discovered that. Open source is
amazeballs. All our favorite R and Python packages are developed there and we can be a part of them.
How cool! I have the zeal of the converted but seriously. Make use of GitHub. Especially if you're doing a
thesis project that you can share with others/point potential employers towards if you're on the hunt for
a new gig.